Extended aggregations for databases with referential integrity issues
نویسندگان
چکیده
Querying databases with incomplete or inconsistent content remains a broad and difficult problem. In this work, we study how to improve aggregations computed on databases with referential errors in the context of database integration, where each source database has different tables, columns with similar content across multiple databases, but different referential integrity constraints. Thus, a query in an integrated database may involve tables and columns with referential integrity errors. In a data warehouse, even though the ETL processes fix referential integrity errors, this is generally done by inserting “dummy” records into the dimension tables corresponding to such invalid foreign keys, thereby artificially enforcing referential integrity. When two tables are joined and aggregations are computed, rows with an invalid or null foreign key value are skipped, effectively eliminating potentially valuable information. With that motivation in mind, we extend SQL aggregate functions computed over tables with referential integrity errors to return complete answer sets in the sense that no row is excluded. We associate to each referenced key in the dimension table, a probability that invalid or null foreign keys refer to it. Our main idea is to compute aggregations over joined tables including rows with invalid or null references by distributing their contribution to aggregation totals, based on probabilities computed over correct foreign keys. Therefore, our extended aggregations can return improved answer sets in databases that violate referential integrity or have referential issues. Experiments with real and synthetic databases evaluate the usefulness, accuracy and performance of our extended aggregations. Email addresses: [email protected] (Javier Garćıa-Garćıa ), [email protected] (Carlos Ordonez ) This is the author’s version. Official version published in Elsevier DKE, 69(1):63-95, 2010
منابع مشابه
Consistent Aggregations in Databases with Referential Integrity Errors
A data warehouse integrates tables coming from multiple source databases, where each database has different tables, columns with similar content across databases and different referential integrity constraints, enforced to different compliance levels. Some source databases may have more reliable data than others, if referential integrity is more strictly enforced or their respective logical dat...
متن کاملA Referential Integrity Browser for Distributed Databases
We demonstrate a program that can inspect a distributed relational database on the Internet to discover and quantify referential integrity issues for integration purposes. The program computes data quality metrics for referential integrity at four granularity levels: database, table, column and value, going from a global to a detailed view, exhibiting specific evidence about referential errors....
متن کاملSemantic Integrity Constraints for Spatial Databases
This paper introduces a formalization of a set of spatial semantic integrity constraints on an extended-relational database model. The formalization extends traditional notions of functional and inclusion dependencies by adding interaction with spatial attributes. This enables to specify implicit and explicit topological conditions between geometries and impose constraints on thematic attribute...
متن کاملOn the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases
We address the problem of minimal-change integrity maintenance in the context of integrity constraints in relational databases. Using the framework proposed by Arenas, Bertossi, and Chomicki [4], we focus on two basic computational issues: repair checking (is a database instance a repair of a given database?) and consistent query answers (is a tuple an answer to a given query in every repair of...
متن کاملSafe Referential
Referential integrity constraints express in relational databases existence dependencies between tuples. Although it is known that certain referential integrity structures may cause data manipulation problems, the nature of these problems has not been explored and the conditions for avoiding them have not been formally developed. In this paper we examine these data manipulation problems and for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Data Knowl. Eng.
دوره 69 شماره
صفحات -
تاریخ انتشار 2010